Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6.
set.seed(123)
N <- runif(1, 8, 1000)
n <- 10000
X <- runif(n, min = 0, max = N)
Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(μ=σ=(N+1)/2\).
set.seed(123)
me <- (N+1)/2
sd <- me
Y <- rnorm(n, mean = me, sd = sd)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
(x <- median(X))
## [1] 145.0453
(y <- quantile(Y, 0.25)[[1]])
## [1] 48.85452
\[P(X>x | X>y) = \frac{P(X>145.0453) \quad& P(X>48.85452) }{P(X>48.85452)}\]
(p_a <- sum(X>x & X>y)/n)
## [1] 0.5
(p_xy <- sum(X>y)/n)
## [1] 0.8318
(p_a/p_xy)
## [1] 0.601106
0.6 is the probability that X is greater than its median given that X is greater than the first quartile of Y.
\[P(X>x , X>y) = {P(X>145.0453) . P(Y>48.85452) }\]
(p_b <- sum(X>x & Y>y)/n)
## [1] 0.369
0.36 is the probability that X and Y are greater than all possible x and y.
\[P(X<x | X>y) = \frac{P(X<145.0453) . P(X>48.85452) }{P(X>48.85452)}\]
(p_c <- sum(X<x & X>y)/n)
## [1] 0.3318
(p_xy <- sum(X>y)/n)
## [1] 0.8318
(p_c/p_xy)
## [1] 0.398894
0.39 is the probability of X less than its median and greater than the first quantile of Y.
Investigate whether \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
(res <- matrix(c(sum(X>x & Y<y),sum(X>x & Y>y), sum(X<x & Y<y),sum(X<x & Y>y)), ncol = 2, nrow = 2))
## [,1] [,2]
## [1,] 1310 1190
## [2,] 3690 3810
res <- cbind(res,c(res[1,1] + res[1,2], res[2,1] + res[2,2]))
res <- rbind(res,c(res[1,1] + res[2,1], res[1,2] + res[2,2], res[1,3] + res[2,3]))
(results <- as.data.frame(res))
results
colnames(results) <- c("X>x", "X<x", "total")
rownames(results) <- c("Y<y", "Y>y", "total")
results
Probability matrix
(prob_tab <- results/n)
Check \(P(X>x \quad and \quad Y>y)=P(X>x)P(Y>y)\) Check the right side : \(P(X>x)P(Y>y)\) from the table we get
(round(0.5*0.75, 2))
## [1] 0.38
Check the leftside: \(P(X>x \quad and \quad Y>y)\) from the table = 0.369 ~ 0.38
Since the results are so similar we can conclude that both X and Y are independent variable
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
fisher.test(results,simulate.p.value=TRUE)
##
## Fisher's Exact Test for Count Data with simulated p-value (based
## on 2000 replicates)
##
## data: results
## p-value = 0.1079
## alternative hypothesis: two.sided
chisq.test(results, correct=TRUE)
##
## Pearson's Chi-squared test
##
## data: results
## X-squared = 7.68, df = 4, p-value = 0.104
“Fisher’s exact test” is a test for comparing two proportions that is conditional on the marginal frequencies. The test is exact in guaranteeing that the type I error is less than or equal to what you specify. It can however be much less. P-values are too large and power is lost. In this sense ordinary chi-square tests are generally “more accurate” than “exact” tests. In addition, the old adage that chi-square tests are not accurate if an expected cell frequency is less than 5 is not correct.
library('ggplot2')
library('ggthemes')
library('scales')
library('dplyr')
library('mice')
library('randomForest')
library('data.table')
library('gridExtra')
library('corrplot')
library('GGally')
library('e1071')
loading dataset
library(RCurl)
## Loading required package: bitops
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:mice':
##
## complete
train <- read.csv('https://raw.githubusercontent.com/salma71/houses_prices_kaggle/master/datasets/train.csv', stringsAsFactors = F)
head(train)
colnames(train)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
Identify the target variable which is the SalePrice
target <- train$SalePrice
summary(target)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
dim(train)
## [1] 1460 81
str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
sum(sapply(train[,1:81], typeof) == "character")
## [1] 43
#Count the number of columns that consists of numerical data
sum(sapply(train[,1:81], typeof) == "integer")
## [1] 38
We have 43 columns that consist of text and 38 columns are numerical.let’s take a look at some descriptive statistics
# Obtain summary statistics
summary(train[,sapply(train[,1:81], typeof) == "integer"])
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
As we can see from the summary statistics above, there are some outliers needs to be proceessed. It would be clear to identify which dependant variables that need processing after visualization.
# The percentage of data missing in train
cat(sprintf("Percentage of missing values in the overall train dataset: %s%s\n", round(length(which(is.na(train) == TRUE)) * 100 / (nrow(train) * ncol(train)), 2), "%"))
## Percentage of missing values in the overall train dataset: 5.89%
Creating one training dataset with categorical variable and one with numeric variable. We will use this for data visualization.
cat_var <- names(train)[which(sapply(train, is.character))]
cat_car <- c(cat_var, 'BedroomAbvGr', 'HalfBath', ' KitchenAbvGr','BsmtFullBath', 'BsmtHalfBath', 'MSSubClass')
numeric_var <- names(train)[which(sapply(train, is.numeric))]
train1_cat<-train[cat_var]
train1_num<-train[numeric_var]
library(ggplot2)
## Bar plot/Density plot function
## Bar plot function
plotHist <- function(data_in, i)
{
data <- data.frame(x=data_in[[i]])
p <- ggplot(data=data, aes(x=factor(x))) + stat_count() + xlab(colnames(data_in)[i]) + theme_light() +
theme(axis.text.x = element_text(angle = 90, hjust =1))
return (p)
}
## Density plot function
plotDen <- function(data_in, i){
data <- data.frame(x=data_in[[i]], SalePrice = data_in$SalePrice)
p <- ggplot(data= data) + geom_line(aes(x = x), stat = 'density', size = 1,alpha = 1.0) +
xlab(paste0((colnames(data_in)[i]), '\n', 'Skewness: ',round(skewness(data_in[[i]], na.rm = TRUE), 2))) + theme_light()
return(p)
}
## Function to call both Bar plot and Density plot function
doPlots <- function(data_in, fun, ii, ncol=3)
{
pp <- list()
for (i in ii) {
p <- fun(data_in=data_in, i=i)
pp <- c(pp, list(p))
}
do.call("grid.arrange", c(pp, ncol=ncol))
}
The bar plots below offer more insight into the data. MSZoning: bar plot indicates that majority of the houses are located in low density residential areas and medium density residential area.
The type of road access to the property tends to be paved and the houses do not have alleys.
Landcontour: the houses are built on flat properties Utilities: Almost all homes have all public utilities (E,G,W, & S) LandSlope: most of the properties have a gentle slope
## Barplots for the categorical features
doPlots(train1_cat, fun = plotHist, ii = 1:11, ncol = 4)
doPlots(train1_cat, fun = plotHist, ii = 12:22, ncol = 4)
The houses that have sever landslope are located in the Clear Creek and Timberland. The houses with moderate landslope are present in more neighborhood. The Clear Creek and the Crawford neighborhoods seem to have high slopes.
train %>%
select(LandSlope, Neighborhood, SalePrice) %>%
filter(LandSlope == c('Sev', 'Mod')) %>%
arrange(Neighborhood) %>%
group_by(Neighborhood, LandSlope) %>%
summarize(Count = n()) %>%
ggplot(aes(Neighborhood, Count)) +
geom_bar(aes(fill = LandSlope), position = 'dodge', stat = 'identity') +
theme_light() +
theme(axis.text.x = element_text(angle = 90, hjust =1))
ggplot(train, aes(x = Neighborhood, y = SalePrice)) +
geom_boxplot() +
geom_hline(aes(yintercept=80),
colour='red', linetype='dashed', lwd=2) +
scale_y_continuous(labels=dollar_format()) +
theme(axis.text.x = element_text(angle = 90))
Boxplot between the neighboorhoods and sale price shows that BrookSide and South & West of Iowa State University have cheap houses. While Northridge and Northridge Heights are rich neighborhoods with several outliers in terms of price.
doPlots(train1_num, fun = plotDen, ii = 2:23, ncol = 4)
## Warning: Removed 259 rows containing non-finite values (stat_density).
## Warning: Removed 8 rows containing non-finite values (stat_density).
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:randomForest':
##
## outlier
## The following objects are masked from 'package:scales':
##
## alpha, rescale
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
correlations <- cor(na.omit(train1_num[,-1])) # omit the id
row_indic <- apply(correlations, 1, function(x) sum(x > 0.3 | x < -0.3) > 1)
correlations<- correlations[row_indic ,row_indic ]
corrplot(correlations, method="square")
ggcorr(correlations, palette = "RdBu", label = TRUE)
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(train1_num[,-1])
(p.mat[, 1:5])
## MSSubClass LotFrontage LotArea OverallQual
## MSSubClass 0.000000e+00 4.846634e-44 8.194126e-08 2.127766e-01
## LotFrontage 4.846634e-44 0.000000e+00 3.697417e-54 8.391777e-19
## LotArea 8.194126e-08 3.697417e-54 0.000000e+00 5.105509e-05
## OverallQual 2.127766e-01 8.391777e-19 5.105509e-05 0.000000e+00
## OverallCond 2.342025e-02 4.019570e-02 8.296279e-01 4.361720e-04
## YearBuilt 2.875793e-01 1.813802e-05 5.869923e-01 8.400355e-128
## YearRemodAdd 1.211631e-01 2.052217e-03 5.985903e-01 1.514450e-116
## MasVnrArea 3.824712e-01 1.536390e-11 6.995925e-05 1.487730e-60
## BsmtFinSF1 7.598820e-03 2.364202e-16 1.340256e-16 1.610533e-20
## BsmtFinSF2 1.210779e-02 8.388577e-02 2.068606e-05 2.388567e-02
## BsmtUnfSF 6.637589e-08 3.979767e-06 9.203746e-01 1.732033e-33
## TotalBsmtSF 2.473278e-20 2.046592e-45 3.911258e-24 3.130271e-110
## X1stFlrSF 1.526459e-22 4.498144e-63 1.234238e-31 1.623197e-83
## X2ndFlrSF 1.985216e-33 5.433170e-03 5.144266e-02 8.317033e-31
## LowQualFinSF 7.586521e-02 1.827783e-01 8.552305e-01 2.452458e-01
## GrLivArea 4.213734e-03 4.612423e-48 1.520481e-24 2.247511e-139
## BsmtFullBath 8.939735e-01 4.587930e-04 1.230800e-09 2.094407e-05
## BsmtHalfBath 9.290423e-01 8.022392e-01 6.646033e-02 1.251676e-01
## FullBath 4.502112e-07 3.634485e-12 1.360045e-06 1.669019e-116
## HalfBath 8.788166e-12 6.365788e-02 5.861562e-01 1.872117e-26
## BedroomAbvGr 3.708290e-01 1.780460e-20 4.519871e-06 9.949911e-05
## KitchenAbvGr 4.852405e-28 8.335889e-01 4.971427e-01 1.435624e-12
## TotRmsAbvGrd 1.230183e-01 2.253852e-36 2.460712e-13 6.425172e-66
## Fireplaces 8.175110e-02 5.373474e-21 4.632098e-26 3.086481e-56
## GarageYrBlt 1.566879e-03 1.834203e-02 3.545896e-01 8.371655e-109
## GarageCars 1.255479e-01 5.428777e-24 2.706813e-09 7.037163e-144
## GarageArea 1.592242e-04 6.734807e-35 3.802721e-12 2.434288e-122
## WoodDeckSF 6.310397e-01 2.136524e-03 4.001008e-11 2.126361e-20
## OpenPorchSF 8.158485e-01 1.210210e-07 1.185832e-03 1.245035e-33
## EnclosedPorch 6.458454e-01 7.110477e-01 4.837906e-01 1.276996e-05
## X3SsnPorch 9.414954e-02 1.520968e-02 4.355273e-01 2.461582e-01
## ScreenPorch 3.202578e-01 1.517836e-01 9.924777e-02 1.314582e-02
## PoolArea 7.518384e-01 5.393081e-13 2.979905e-03 1.275642e-02
## MiscVal 7.692691e-01 9.071905e-01 1.459891e-01 2.304122e-01
## MoSold 6.040063e-01 6.982030e-01 9.633078e-01 6.790956e-03
## YrSold 4.137257e-01 7.964813e-01 5.861053e-01 2.963852e-01
## SalePrice 1.266472e-03 2.602442e-36 1.123139e-24 2.185675e-313
## OverallCond
## MSSubClass 2.342025e-02
## LotFrontage 4.019570e-02
## LotArea 8.296279e-01
## OverallQual 4.361720e-04
## OverallCond 0.000000e+00
## YearBuilt 3.090962e-50
## YearRemodAdd 4.816090e-03
## MasVnrArea 9.716412e-07
## BsmtFinSF1 7.741099e-02
## BsmtFinSF2 1.244256e-01
## BsmtUnfSF 1.530090e-07
## TotalBsmtSF 4.685261e-11
## X1stFlrSF 3.126348e-08
## X2ndFlrSF 2.690898e-01
## LowQualFinSF 3.303245e-01
## GrLivArea 2.311138e-03
## BsmtFullBath 3.580810e-02
## BsmtHalfBath 6.367488e-06
## FullBath 7.242238e-14
## HalfBath 2.022485e-02
## BedroomAbvGr 6.202024e-01
## KitchenAbvGr 8.754658e-04
## TotRmsAbvGrd 2.779335e-02
## Fireplaces 3.630812e-01
## GarageYrBlt 3.914853e-35
## GarageCars 8.425076e-13
## GarageArea 5.945338e-09
## WoodDeckSF 8.987253e-01
## OpenPorchSF 2.133227e-01
## EnclosedPorch 7.159420e-03
## X3SsnPorch 3.301473e-01
## ScreenPorch 3.625216e-02
## PoolArea 9.395944e-01
## MiscVal 8.568159e-03
## MoSold 8.933753e-01
## YrSold 9.321243e-02
## SalePrice 2.912351e-03
# Leave blank on no significant coefficient
corrplot(correlations, type="upper", order="hclust",
p.mat = p.mat, sig.level = 0.05, method = 'pie' ,insig = "blank")
plotCorr <- function(data_in, i){
data <- data.frame(x = data_in[[i]], SalePrice = data_in$SalePrice)
p <- ggplot(data, aes(x = x, y = SalePrice)) +
geom_point(shape = 1, na.rm = TRUE) +
geom_smooth(method = lm ) +
xlab(paste0(colnames(data_in)[i], '\n', 'R-Squared: ', round(cor(data_in[[i]], data$SalePrice, use = 'complete.obs'), 2))) +
theme_light()
return(suppressWarnings(p))
}
highcorr <- c(names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] > 0.5)], names(correlations[,'SalePrice'])[which(correlations[,'SalePrice'] < -0.2)])
data_corr <- train[,highcorr]
doPlots(data_corr, fun = plotCorr, ii = 1:11)
The histogram for the response variable SalePrice shows that it is skewed. Taking the log of the variable normalizes it.
library(scales)
ggplot(train, aes(x=SalePrice)) +
geom_histogram(col = 'white') +
theme_light() +
scale_x_continuous(labels = comma)
#Normalize distribution
ggplot(train, aes(x=log(SalePrice+1))) +
geom_histogram(col = 'white') +
theme_light()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
signif_col <- train1_num %>%
select(YearBuilt, FullBath, OverallQual, GrLivArea, TotRmsAbvGrd, X2ndFlrSF, SalePrice)
# options(repr.plot.width = 20, repr.plot.height = 20)
ggpairs(signif_col)
corr <- train1_num %>%
select(SalePrice, TotRmsAbvGrd, GrLivArea)
ggcorr(tidyr::drop_na(corr), palette = "RdBu", label = TRUE)
Test the hypotheses:
(tot_vs_sal <- cor.test(corr$TotRmsAbvGrd, corr$SalePrice, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: corr$TotRmsAbvGrd and corr$SalePrice
## t = 24.099, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4960020 0.5694337
## sample estimates:
## cor
## 0.5337232
(gr_vs_tot <- cor.test(corr$GrLivArea, corr$TotRmsAbvGrd, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: corr$GrLivArea and corr$TotRmsAbvGrd
## t = 55.846, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8084234 0.8411686
## sample estimates:
## cor
## 0.8254894
(gr_vs_sal <- cor.test(corr$GrLivArea, corr$SalePrice, method = 'pearson'))
##
## Pearson's product-moment correlation
##
## data: corr$GrLivArea and corr$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6821200 0.7332695
## sample estimates:
## cor
## 0.7086245
The p-value in the three varaibles is below significance level of 0.05 - around 2.2e-16. This implies that the correlation falls within the c(0.8, 0.2) confidence interval and we failed to accept the null hypothese which assumes that the correlation is zero.
The 80% confidence interval
library(psychometric)
(CIr(r = tot_vs_sal$estimate, n = nrow(corr), level = 0.8))
## [1] 0.5092841 0.5573021
(CIr(r = gr_vs_sal$estimate, n = nrow(corr), level = 0.8))
## [1] 0.6915087 0.7249450
(CIr(r = gr_vs_tot$estimate, n = nrow(corr), level = 0.8))
## [1] 0.8144931 0.8358928
To calculate the familywize error (type-I error) \(FWE <= 1-(1-alpha)^c\) where:
1. \(alpha\) is the alpha level for an individual test - in this case it is 0.05. 2. \(c\) is the number of comparisons, which is 3.
perc <- (1 - (1 - 0.05)^3)
cat(sprintf("The probability of a type-I error is : %s%s\n", perc, " "))
## The probability of a type-I error is : 0.142625
Yes, I would be worry about that probability, which is very high considering only three tests were performed.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
cor_mat <- data.matrix(cor(corr))
rownames(cor_mat) <- c()
colnames(cor_mat) <- c()
(cor_mat2 <- round(cor_mat, 2))
## [,1] [,2] [,3]
## [1,] 1.00 0.53 0.71
## [2,] 0.53 1.00 0.83
## [3,] 0.71 0.83 1.00
The determinant of the correlation matrix not zero, the matrix has an inverse
library(matlib)
det(cor_mat2) # get determinant
## [1] 0.150758
(cor_inv <- inv(cor_mat2)) # get the inverse matrix
## [,1] [,2] [,3]
## [1,] 2.0635721 0.3933456 -1.791613
## [2,] 0.3933456 3.2893777 -3.009459
## [3,] -1.7916131 -3.0094589 4.769896
(by_inv <- cor_mat2 %*% cor_inv)
## [,1] [,2] [,3]
## [1,] 1.000000e+00 2.7e-09 -5.9e-09
## [2,] 9.000001e-10 1.0e+00 -6.9e-09
## [3,] -3.000000e-10 1.7e-09 1.0e+00
(res <- round(by_inv, 1))
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
get the inverse
library(matrixcalc)
# LU decomposition
(cor_lu <- lu.decomposition(cor_mat2))
## $L
## [,1] [,2] [,3]
## [1,] 1.00 0.0000000 0
## [2,] 0.53 1.0000000 0
## [3,] 0.71 0.6309275 1
##
## $U
## [,1] [,2] [,3]
## [1,] 1 0.5300 0.7100000
## [2,] 0 0.7191 0.4537000
## [3,] 0 0.0000 0.2096482
# multiply L and U
#
(cor_lu$L %*% cor_lu$U)
## [,1] [,2] [,3]
## [1,] 1.00 0.53 0.71
## [2,] 0.53 1.00 0.83
## [3,] 0.71 0.83 1.00
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
fit_data <- train$MasVnrArea
fit_data <- fit_data[complete.cases(fit_data)]
hist(fit_data)
summary(fit_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 103.7 166.0 1600.0
length(fit_data[fit_data == 0])
## [1] 861
Out of the 1452 houses, there are 861 that have no masonry veneers (area is zero).
Because the data measures area, adding a value of .01 should be negligible and would get rid of the zero values. A property with a masonry veneer area of .01 square feet would mean this property does not really have any masonry veneer.
fit_data <- fit_data + .01
hist(fit_data)
summary(fit_data)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 0.01 0.01 103.70 166.01 1600.01
library(MASS)
fit <- fitdistr(fit_data, densfun = "exponential")
# the optimal value
fit$estimate
## rate
## 0.009643642
exp <- rexp(length(fit_data), rate = fit$estimate)
hist(exp)
qexp(.05, rate=fit$estimate)
## [1] 5.318872
qexp(.95, rate=fit$estimate)
## [1] 310.6432
norm.interval = function(data, variance = var(data), conf.level = 0.95)
{
z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
xbar = mean(data)
sdx = sqrt(variance/length(data))
c(xbar - z * sdx, xbar + z * sdx)
}
norm.interval(fit_data, variance=var(fit_data), conf.level = 0.95)
## [1] 94.38199 113.00853
quantile(x=fit_data, probs=c(.05, .95))
## 5% 95%
## 0.01 456.01